Classifier performances for deriving replicates

For each disease, we derive replicates of the mapping of RCTs across diseases after simulating what would have been the mapping of RCTs within regions if the misclassification of RCTs towards groups of diseases was corrected, given the sensitivities and specificities of the classifier to identify each group of disease.

To estimate the performances of the classifier for each group of diseases, we dispose a test set with 2,763 trials manually classified towards the 27-class grouping of diseases used in this work. The test set is described at Atal et al. BMC Bioinformatics 2016.

This script is for calculating sensitivity and specificity of the classifier to identify the disease and other studies relevant to the burden of diseases, and the number of success and number of trials to derive beta distributions

1. Sensitivities and specificities based on test set


In [1]:
test_set <- read.table("/media/igna/Elements/HotelDieu/Cochrane/MetaMapBurden/Paper_classifier/NCT_data_classified_to28cats.txt")
dim(test_set)


  1. 2763
  2. 8

In [2]:
#We supress injuries from trials concerning the burden of diseases
test_set$GBDnp <- sapply(strsplit(as.character(test_set$GBDnp),"&&"),function(x){paste(x[x!="28"],collapse="&")})
test_set$GBD28 <- sapply(strsplit(as.character(test_set$GBD28),"&"),function(x){paste(x[x!="28"],collapse="&")})

In [3]:
tst <- strsplit(test_set$GBDnp,"&")
alg <- strsplit(test_set$GBD28,"&")
tst <- lapply(tst,as.numeric)
alg <- lapply(alg,as.numeric)

In [4]:
source('Evaluation_metrics.R')

In [5]:
dis <- 1:27
Mgbd <- read.table("/home/igna/Desktop/Programs GBD/Classifier_Trial_GBD/Databases/Taxonomy_DL/GBD_data/GBD_ICD.txt")

In [6]:
#For each category in 1:27, TP, TN, FP and FN of finding the disease and of finding another disease
set.seed(7212)

dis <- as.character(1:27)

PERF_F  <- data.frame()
for(i in dis){
    ALG <- lapply(alg,function(x){rs <- c()
                                  if(i%in%x) rs <- c(1)
                                  if(sum(setdiff(dis,i)%in%x)!=0) rs <- c(rs,2)
                                  return(rs)
                                      })

    DT <- lapply(tst,function(x){rs <- c()
                                if(i%in%x) rs <- c(1)
                                if(sum(setdiff(dis,i)%in%x)!=0) rs <- c(rs,2)
                                return(rs)
                                    })

    CM <- conf_matrix(ALG,DT,c(1,2))

    PERF <- c(CM[1,],CM[2,])
    PERF_F <- rbind(PERF_F,PERF)
}

In [7]:
#We add performances of classifier to identify trials relevant to the burden of diseases
    ALG <- lapply(alg,length)
    DT <- lapply(tst,length)
    CM <- conf_matrix(ALG,DT,1)
    PERF <- c(CM,rep(NA,4))
    PERF_F <- rbind(PERF_F,PERF)

In [8]:
PERF_F <- data.frame(PERF_F)
names(PERF_F) <- paste(rep(c("TP","FP","TN","FN"),2),rep(c("_Dis","_Oth"),each=4),sep="")

In [9]:
PERF_F$dis <- c(dis,0)
PERF_F$GBD <- c(as.character(Mgbd$cause_name[-28]),"All")

In [10]:
PERF_F <- PERF_F[,c(9,10,1:8)]

In [11]:
PERF_F


disGBDTP_DisFP_DisTN_DisFN_DisTP_OthFP_OthTN_OthFN_Oth
11 Tuberculosis14 2 2745 2 2142 204 267 150
22 HIV/AIDS86 7 2659 11 2072 214 333 144
33 Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases40 21 2693 9 2113 207 299 144
44 Malaria14 1 2748 0 2142 204 267 150
55 Neglected tropical diseases excluding malaria6 0 2756 1 2150 203 261 149
66 Maternal disorders17 5 2715 26 2130 210 289 134
77 Neonatal disorders4 7 2746 6 2148 205 262 148
88 Nutritional deficiencies11 15 2732 5 2140 201 272 150
99 Sexually transmitted diseases excluding HIV0 3 2759 1 2155 203 255 150
1010 Hepatitis14 4 2742 3 2141 208 262 152
1111 Leprosy2 1 2760 0 2154 203 256 150
1212 Neoplasms933 42 1763 25 1213 214 1198 138
1313 Cardiovascular and circulatory diseases178 60 2468 57 1951 217 466 129
1414 Chronic respiratory diseases76 17 2665 5 2074 209 328 152
1515 Cirrhosis of the liver19 17 2723 4 2133 211 267 152
1616 Digestive diseases (except cirrhosis)24 28 2703 8 2129 199 289 146
1717 Neurological disorders79 40 2630 14 2060 211 339 153
1818 Mental and behavioral disorders134 33 2587 9 2014 198 402 149
1919 Diabetes, urinary diseases and male infertility196 63 2458 46 1930 213 473 147
2020 Gynecological diseases9 8 2744 2 2146 206 262 149
2121 Hemoglobinopathies and hemolytic anemias10 4 2743 6 2143 203 270 147
2222 Musculoskeletal disorders100 40 2610 13 2046 188 382 147
2323 Congenital anomalies22 34 2706 1 2121 205 275 162
2424 Skin and subcutaneous diseases18 24 2717 4 2134 198 281 150
2525 Sense organ diseases52 40 2667 4 2085 190 322 166
2626 Oral disorders3 4 2751 5 2150 207 258 148
2727 Sudden infant death syndrome0 0 2763 0 2156 203 254 150
280 All 2022165 314 262 NA NA NA NA

In [12]:
write.csv(PERF_F,'Tables/Performances_per_27disease_data.csv')

In [ ]: